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Get your phones ready for the next slide, since I'll be sharing some helpful links. I'll show it again at the end. 



Some Handy Links 


• https://github.com/internetarchive 

• https://archive.org/details/self 2016 -ia-apis 

• https ://www. zotero . org/g rou ps/i nter net_arch i ve_- 
_open_apis_and_examples/items 


* The Internet Archive github organization 

* The Zotero group where you can find all of the links and references we'll be mentioning 

* These slides are already available at this Internet Archive item 

Don't worry, we'll be showing these links again at the end of the presentation. 



A Tour of Internet 
Archive 




Located in SF but scanning centers all over the world. Old Christian Science church. Yes, it does look exactly like our logo. I'll neither confirm nor deny that's the reason why we bought it. I'll show pics of the 
inside as we go along. 



/Internet Archive 


1 

- 


501(c)(3) non-profit, founded in 1996 
Registered library in California, USA 
Partnerships with hundreds of libraries, 
museums & universities 
24 petabytes of unique data 






Non-profit Digital Library. Universal access to all knowledge. Collect in one place & make it available as freely as possible. Partners help with preservation & digitization. 24PB, all stored at least twice, + 
overhead == -55PB of spinning disks. 



Inside of the Great Room. 



n li 

Photo: Jason Scott 



Most FAQ, so get it out of the way. $14-15 million per year. -40% from book scanning. -20% from internet archiving. Web crawls: Nat’l Lib of NZ -> crawl every .nz. All those crawls go into the Wayback Machine. The rest from 
foundations & donations. 



-100 employees. Get a statue at yr 3. 


w 




As a library, are concerned about privacy. Know nothing about the visitors of the site, not even if it's a repeat visitor or from where they're visiting. Patron privacy is very important. 



Run own data centers, one in SF & one in Richmond, CA. 






487 billion captures right now. Want to study? We've an 80TB crawl you can have for that. 
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Here's what the Wayback Machine looks like, though I suspect if you're in this session you already know this, 
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How do we decide what to archive? 

• People pay us 

• Organizations donate crawled content 

• We crawl on our own behalf 

• Deep crawl on popular sites 

• Broad but shallow crawl on known domains 

• Targeted crawls 

V : J 


Alexa donates since 1996. Deep crawls on popular sites. Try to do a shallow crawl across every domain they can find. Targeted crawls: every tweeted YouTube link, every outlink from Wikipedia in every 
language, every blog page posted on wordpress.com & all links they point to. Trying to reduce 404s on the internet. YO! MOZILLA! WE WANT TO GET THIS INTO FIREFOX! 



Easiest way to get something into the Wayback. Immediately crawled. Permanent & citable URL. (average webpage changes every 100 days; this snapshots what you want to cite) 



Digitize books! 




Brings in 40% of the budget. Not only does scanning books help us in our mission to provide universal access to human knowledge, it's also important to digitize books even though not an obsolete format 
yet because... 




Scanning center burned down. All had been scanned. Artifacts lost, but knowledge was retained. Believe physical artifacts are important, though... 


1 of 2 warehouses. Physical archive. 




Physical Archive 

1 ,500,000 books + plus films, VHS and LPs 


Doing our damnedest to get a copy of every book. Seed bank for the future. 



Easiest way to find the books on the Archive. Goal: 1 webpage for every book ever published. Can download, borrow (because, duh, we're a library), print disabled access. 
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Video 

• 2,000,000 items 

• Feature films, documentaries, commercials, 
propaganda, stock footage, cartoons 

• Archive 65 channels of TV 24 hours per day 


TV archive is a large part of the collection, but only news is available right now. 
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TV News Archive. 550,000 programs. Search captions, locate a clip, create a citable quote. 







Audio 

• 2,500,000 items 

• 150,000 live concerts from 6,000 bands 

• 60,000 netlabel releases from 2,000 labels 

• Mirrors of Jamendo, IUMA 

• Audio books, radio shows, podcasts 
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Emulation in browser -> Contact Jason to help. 


Why? 

• We are good at storing and serving 
digital media and preserving it 

• We care about the same things: 
knowledge, keeping information open, 
privacy 

• We fight for what we care about 

• We’re not slick, but we are friendly! 


We want you to use the Archive. We care about knowledge and freedom and privacy and access. We want to help you use the Archive. 



Internet Archive APIs 
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Let's set up some expectations. Internet Archive provides many different ways to access and contribute to those 24+ Petabytes of data. There's a lot to cover, so 
each rather than a deep dive. Complete information can be found at the Zotero link. 


I'll only be giving you a brief overview 


Wayback Machine API 

http : //archive . org/help/wayback_api . php 


First up: The Wayback Machine API. 



Wayback Machine API 


• Is a URL archived? 

• If so, is it available in the Wayback Machine? 


This API is a study in simplicity & ease: exactly what you need w/o clutter. 



Wayback Machine API 



Simple URL-based API 

Only three possible parameters (url, timestamp, callback) and two are optional 
Returns JSON 



Wayback Machine API 



Parameters for the Wayback API 

* url = what you're looking up (no protocol; http, etc.) 

* timestamp = YYYYMMDDhhmmss; YYMMDD OK; Will return closest snapshot to this timestamp 

* callback = for jsonp 
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We've had a bit of excitement recently about New Horizons & its fly-by of Pluto, so let's use that as the basis for our examples today. 



Wayback Machine API 



Let's see whether the Wayback has their site on record yet. Just throw this to curl and see what it returns... 



Wayback Machine API 


archived_snapshots : { 
closest: { 

available: true, 

url : "http : //web . archive . org/web/ 
20150711221849/https : // 

solarsys tem . nasa . gov/planets/profile . cfm? 
Object=Dwarf " , 

timestamp: "20150711221849" , 
status: "200" 

} 

} 

L J 


That's more like it. 



Wayback Machine API 


r ^ 

{ "archived_snapshots" : { } } 

L J 


This is what you see when there is no entry for the URL in the Wayback. This API is in use in some pretty interesting places. One example is: 



404 Handler 

Fr#a *404: FSi Not Found"* Handler tor Wobmattar* to Improve 
Umt Exptrtortc* 

The Internet Archive today U launching a free service to 
help webmaster* improve the tr user experience by 
augmenting their website'* 404 Page X« Found page to 
link to the Wayback Machine in the case that it has it. 

Therefore users trying to get to any pages that might 
have been 00 a previous vrroc© of your webaafte vrdl 
turn be given the option 10 go to the Wayback Machine. 

To embed a hnk to the Wayback Machine on your sites 404 page*. Just 
include this line in your error page; 
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https://blog.archive.org/2013/10/24/web-archive-404-handler-for- 

webmasters/ 


The 404 Handler. Placed on your 404 page, if an archived version of the page is in the Wayback then it'll offer a link to it. 


Open Library API 


https : //openlibrary . org/developers/api 


Open web page for every book ever published. Think of it rather like Wikipedia for books. The library you can edit. 
A full RESTful API. 

Can return Open Library bibliographic & holdings records in JSON and RDF formats 



Open Library API 


• Query the Open Library database 

• View record information 

• Edit record information 

• View record history 


Features: most everything you can do on the Open Library website: Query the DB, look at information for books & authors, edit records and view edit history... 



Open Library API 


http : / /openlibrary . org/subjects/pluto . json 
?limit=l 


Let's see what Open Library has for the subject "pluto". I'll limit it to just one entry so it'll fit on the next slide. 



Open Library API 


r 

{ 

works : [ 

{ 



{ 

name: "Nigel Henbest", 



printdisabled: false , 

key: "/authors/ 



cover id: null , 

OL449922A" 



ia collection: [ ] , 

} 



has fulltext: false , 

] , 



edition_count : 2 , 

ia: null. 



checked out: false. 

lending identifier: "", 



title: "The planets". 

subject: [ 



public_scan: false. 

"Solar system". 



lendinglibrary : false. 

"Juvenile literature". 



lending edition: "", 

"Planets" 



overdrive : " " , 

] 



first_j?ublish year: null. 

} 



key: "/works/OL2715443W" , 

] , 



authors : [ 

subject type: "subject". 



{ 

work count: 13, 



name: "Heather Couper", 

key: "/subjects/planets". 



key: "/authors/ 

name: "planets" 



OL397296A" 

} 


L_ 

} f 
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So I see there's "The Planets" by Heather Couper & Nigel Henbest. 



Evergreen 
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http://evergreen-ils.org/ 


An example of this API in the wild: The Evergreen open source Integrated Library System uses the API to retrieve book covers and other information from OL. 


Do-We-Want-lt? API 


http : //want . archive . org/ 


So, how do all of those books get into Open Library? Well, a lot of them are donated by people like you. The Archive will accept your spare book, scan it & make it available online, and then save the book 
itself in its physical archive. But space is limited so they've provided a simple API to help you see whether they have a copy of that book yet. 



Do-We-Want-lt? API 


r 




http: //want. archive. org/api?isbn=<isbnlO or isbn!3> 


L 
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This API works with anything which has an ISBN. It takes one argument and returns very easy to understand JSON... 



Do-We-Want-lt? API 
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l.I.IO 
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NOVA 

•PtCIAL 
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PBS 



http : // 

want . archive . org/api? 
isbn=978-0393350395 


So let's see whether the Archive needs any copies of this book by Neil DeGrasse Tyson & Donald Goldsmith... 


Do-We-Want-lt? API 


{ 

status : "success ” , 
result: "1", 

description: "want_for_ia_pa " , 
identifier: "-1" 

} 



Yup! We can see here that the Archive would like to have a copy of this book. It doesn't yet have it available. 



Do-We-Want-lt? API 


r 


Response keys : 
status 

fail - We failed to process the request. The submitted ISBN was 
invalid. 

success - The request was serviced successfully, 
result 

We failed to process the request. The submitted ISBN was 
invalid. 

The request was serviced successfully, but we have two copies, 
and do not want more . 

The request was serviced successfully, and we have no copies. 
We want it. 

The request was serviced successfully, and we have 1 copy 
already. We want it. 


L 


-1 - 

0 - 

1 - 

2 - 


description - "human -par sable" description, with respect to the above: 
failure 
do_no t_wan t 
wan t_f o r_i a_pa 
wan t_f o r_a 1 t_pa 


identifier - String when result = 2. This is the identifier already 
assigned on the cluster for that ISBN 

-1 - No identifier designated for this ISBN. 

<string> - Designated identifier for this ISBN. 
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The results are pretty easy to read, but here you can see all of the possibilities for data returned. 



IA Search API 

https : //archive . org/advancedsearch . php 


As Alexis has pointed out, the Archive has ALL THIS GREAT STUFF. But how do you find it? This isn't a documented API so much as an easily extrapolatable URL format. 



IA Search API 


Advanced Search 

This form allows you to perform an advanced search. You only need to fill in one held below. This can be 
any field. If you select "not" as your match criteria, you must select one other field. 
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So let's continue my quest to learn more about planets, performing a search for all texts with a subject value of 'pluto/ 


IA Search API 


NASA Technical 
Documents 

Tctnu docmm MAA 
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The search terms: 
mediatype:(texts) AND 
subject:(pluto) 

https : // 
archive . org/ 
search .php? 
que r y =medi a type % 3 A 
%28texts%29%20AND 
%20subject%3A 
%28pluto%29 


It returns this URL. It has some HTML entity encoding going on, but otherwise makes it pretty obvious how to build a URL. OK, that's pretty cool, but... 



IA JSON API 


Advanced Search returning JSON, XML, 
and more 



https : //archive . org/ 

advancedsearch . php? 

q=mediatype%3A 

%28texts%29+AND 

+subject%3A%28pluto 

%2 9+AND+collection%3A 

%28nasa%29&fl%5B 

%5D=identifier&sort 

%5B%5D=&sort%5B 

%5D=&sort%5B 

%5D=&rows=50&page=l &o 

utput= j son&callback=c 

allback 


...the fact that I can ask for the exact same data to be returned as JSON (or CSV, or XML, or...). As you can see, this URL is a bit busier, but comparing it against the form it's quite easy to see what's going 


on. 


IA JSON API 


r 


L 


callback ( 

{ 

responseHeader : { 

status : 0 , 

QTime: 108, 
params : { 

json.wrf: "callback", 
wt: " json" , 
rows: "50", 

qin: "mediatype : (texts) AND subject : (pluto) " , 
fl: "identifier", 
start: "0", 

q: "mediatype : (texts ) AND subject: (pluto )" 

} 

}, 

response : { 
numFound: 32, 
start: 0, 
docs : [ 


{ 


identifier: "nasa techdoc 20050060913" 


}, 
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But it's even easier to see when you view the JSON output. Note the identifier. 



IA JSON API 

• Lucene-based 

• Grouping 

• Fuzzy queries 

• Relevance boost 

• Date ranges 

• Etc. 


You can do some pretty sweet things here. But wait, there's more! 



IA Metadata API 

http : / /blog . archive . org/2013/ 07/04/me tadata-api/ 


The Internet Archive Metadata API allows you to DATA MINE that entire 24+ Petabytes of data. And it's WAY faster than it has any right to be. 



IA Metadata API 


http : //archive . org/metadata/ 
nasa techdoc 20050060913/metadata 


I want to learn more about that item I retrieved with the JSON API, so let's call the Metadata API on its identifier. To make the response slightly shorter, I'm going to limit it to just the most metadata-est part 
of the information, rather than ALL of it. 



IA Metadata API 



result: { 

identifier : "nasa_techdoc_20050060913 M , 
date: "2004", 


description: "Terra MODIS 250 m observations are being applied to a Suspended Sediment Concentrat 
algorithm that is under development for coastal case 2 waters where reflectance is dominated by sedin 
entrained in major fluvial outflows. An atmospheric correction based on MODIS observations in the 50C 
resolution 1.6 and 2.1 micron bands is used to isolate the remote sensing reflectance in the MODIS 25 
resolution 650 and 865 nanometer bands. SSC estimates from remote sensing reflectance are based on ac 
inherent optical properties of sediment types known to be prevalent in the U.S. Gulf of Mexico coastc 
present our findings for the Atchafalaya Bay region of the Louisiana Coast, in the form of processed 
over the annual cycle. We also apply our algorithm to selected sites worldwide with a goal of extendi 
utility of our approach to the global direct broadcast community.", 
document- source : "CASI", 
documentid: "20050060913", 

nasa-center: "Goddard Space Flight Center", 

online-source : "http : //wayback . archive-it . org/1792/20100127084754/http : //hdl . handle . net/2060/2005 
original-nasa-rights : "Unclassified; No Copyright; Unlimited; Publicly available; Progress Report 
title: "Estimating Coastal Turbidity using MODIS 250 m Band Observations", 
upda ted- added- to -ntrs : "2008-06-02" , 
year: "2004", 

collection: "nasa_techdocs" , 
contributor: "NASA", 
language : "eng" , 

licenseurl : "http : / / creativecommons . org/licenses/publicdomain/" , 
mediatype : " texts " , 
rights: "Public Domain", 

L A 


And we get more JSON. There's a LOT of data here and I've truncated the output here. 



IA Metadata API 

It can write metadata, too! 


If I had the authorizations to change this item's metadata. Which I don't. But if you do, then you can if you want. wOOt. 



But Wait! 
There’s more! 


So right now you're probably thinking, "Sure, all this mining of 24+PB of data is neat and all, but how do I add to it? 



IAS3 API 


https : //github . com/vmbrasseur/IAS3API 


This is the big daddy: The Internet Archive S3-like API. And now you know why I'm up here speaking to you today: I'm the maintainer of the documentation for this API, which you can find at this GitHub 
URL 


Reminder: You can upload ANYTHING to IA. For free. As much as you want. They'll serve it up & preserve it forever. For free. 



IAS3 API 


Create items in Internet Archive, upload 
files to those items, maintain the metadata 
for the items, and download from any 
publicly-available Internet Archive items. 


Doesn't work for Open Library, but otherwise? Much of that stuff I just showed you? Can be handled with this one Big Daddy of an API. 



IAS3 API 


• Drop-in replacement for the Amazon S3 API 

• Pick your favorite S3 library/client, change the 
server to s3.us.archive.org, and you’re good to go. 


This is a pretty involved API, so I'll only provide one brief and simple example. There are more in the documentation. 



IAS3 API 


curl --location --header ' x-amz -auto-make-bucket : 1 ' \ 

--header 'x-archive-meta01-collection:nasa_techdocs ' \ 
- -header ' x-archive-meta-mediatype : movie ' \ 

--header \ 

' x- archive -me ta- title : Pluto Fly By' \ 

--header "authorization: LOW $accesskey : $secret" \ 
--upload- file new-horizon .mp4 \ 
http : //s3 . us . archive . org/pluto-new-horizon/new-horizon .mp4 




Create a new item (aka bucket) on Internet Archive with the identifier pluto-new-horizon, Assign the item to the 'data' mediatype, then upload the file new-horizon. mp4 to the item 
Can also download, change metadata, etc. A lot of people & organizations use this API, so I'll only highlight a very few of them. 
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https : //archive . org/details/usfederalcourts 

https : //archive . org/details/publicsafetycode 


RECAP from Aaron Swartz & Global Public Safety Codes from Carl Malamud to free otherwise locked up public information. 


But, IMO, the most exciting use of IAS3API is by NASA. 



https://archive.org/details/nasa 


Life is short. 
What if I don’t want to 
learn S3? 


Sure, you think that's cool & all, but your time is valuable and you really don't want to spend it learning S3? OK, we can work with that. 



la-wrapper 


https://github.com/jjjake/ia-wrapper 


We have Jake Johnson at the Archive to thank for this little wonder. This is a Python wrapper around IAS3 and the metadata and search APIs. It includes utilities for everything you want to do, without all the 
mess of wrangling S3 API headers. As if that weren't good enough... 



iadownload 


#! /usr/bin/env python 

# iadownload: Download all files in a collection or item 

# Copyright 2014 VM Brasseur 

import os 
import sys 

import internetarchive 
import pprint 
import argparse 
import j son 


https://github.com/vmbrasseur/iadownload 



It also includes a Python library, so you can build your own utilities and services. As you can see, I use this myself. Not only do I use it for downloading... 
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https://github.com/vmbrasseur/iaupload 


http://archive.org/details/sfperlmongers 


I also use it to upload. I'm the organizer of the San Francisco Perl Mongers user group. We now record all of our events and upload the videos to the Archive for all to see. But we don't really have a ton of 
material. To really put ia-wrapper through its paces we need to look to... 


Saving All The Things 




Jason Scott. Internet Archive employee, Founder of Archive Team, and activist computer archivist. As of the writing of this talk, Jason has uploaded just shy of 300K items to the Archive. Most of the items 
contain several files. Jason uses and swears by ia-wrapper to help him archive as much of computer history as inhumanly possible. 




As you can imagine, there's a WHOLE lot more which could be said on this topic. But now you at least know where to start looking for more information. Let's recap: 




Those Links Again... 


• https://github.com/internetarchive 

• https://archive.org/details/lca 201 6 -ia-apis 

• https ://www. zotero . org/g rou ps/i nter net_arch i ve_- 
_open_apis_and_examples/items 


As promised, here are those links again. Snap a picture or find the slides at the IA item (last link). We have one more important link to share with you... 



Donate to Internet Archive! 


http : //archive . org/donate/ 


Your support helps us build amazing services and keep them free for people around the globe. 



THANK YOU! 


https://archive.org/about/jobs.php 

We need programmers to help us change (and preserve) 

the world! 

VISIT US 

Free lunch Fridays at noon 
300 Funston Ave 
San Francisco 

alexis@archive.org 


Questions? 



